Simulation of distributions for the CLT/FLE experiment

Author

Stefano Dalla Bona

1. Fitting a Theoretical Distribution

To determine an appropriate Effect Size (ES) for detecting differences between two groups exposed to a null contingency illusory condition (one in a native language – NL, and one in a foreign language – FL), we first examine the expected distribution of the dependent variable for the NL group. This distribution is based on data from our previous study (Dalla Bona & Vicovaro, 2024). We then extend this analysis to consider the expected differences between the NL and FL groups.

Given that our current Contingency Learning Task (CLT) experiment will be conducted online using PsychoPy, and that we previously conducted an online experiment using it (with data available at: https://osf.io/c26qa/files/osfstorage), we use data from the prior study as a reference distribution. Specifically, we focus on the control group, which was not exposed to any manipulations of perceptual features, in the null contingency illusory condition. Our objective is to identify which generative distribution most likely underlies the observed empirical data points.

To identify potential candidate distributions, we employ a Cullen and Frey graph that plots kurtosis against the square of skewness (Cullen & Frey, 1999). This visualization helps us assess the distributional characteristics of the data and select the most appropriate generative model.

1.1 Importing Data and Describing It

We begin by loading the data from the Excel file and filtering it to include only participants with valid responses (valido == "si"). The tempoistr (time spent in the experiment) variable is then converted to numeric format, and we exclude any rows where the time spent is less than 10 seconds, as this was an exclusion criterion in the previous experiment. Next, we extract the causal evaluation scores for participants in the control group who were exposed to the null contingency scenario (identified by gruppo == 1). These scores are stored in the vector_illusion variable.

# Load required libraries
library(readxl)

# Read the data from the Excel file
data <- as.data.frame(read_xlsx("Dataframeclt_online.xlsx"))

# Display the structure of the dataset
str(data)
'data.frame':   249 obs. of  15 variables:
 $ cp1       : chr  "A" "A" "A" "B" ...
 $ cp2       : num  1 2 3 4 5 6 7 8 9 10 ...
 $ REF       : num  1 2 3 5 6 7 8 9 10 11 ...
 $ email     : chr  "mariaconcetta.mercadante@studenti.unipd.it" "ramonadaniela.ciumau@studenti.unipd.it" "alessiaqueraiti02@gmail.com" "giacomo.delgobbo@gmail.com" ...
 $ sesso     : chr  "F" "F" "F" "M" ...
 $ eta       : num  24 24 22 27 25 23 22 23 30 30 ...
 $ valido    : chr  "si" "si" "si" "si" ...
 $ gruppo    : num  3 2 1 2 1 4 4 3 4 3 ...
 $ tempoistr : chr  "33.7546" "33.2288" "67.6089" "138.8366" ...
 $ tempotask : chr  "347.3093" "336.9458" "328.1303" "265.2705" ...
 $ tempovd   : chr  "10.4335" "10.8814" "9.2636" "44.5247" ...
 $ tempocheck: chr  "12.1336" "9.3648" "7.7313" "9.0483" ...
 $ tempotot  : chr  "486.661" "434.2955" "469.9609" "537.1056" ...
 $ vd        : num  59 50 69 40 70 80 75 68 70 80 ...
 $ check     : num  1 6 1 7 2 6 5 1 6 1 ...
# Subset the first 209 rows
data <- data[1:209,]

# Filter data for valid responses
data <- subset(data, data$valido == "si")

# Convert 'tempoistr' to numeric and filter out fast responders (time < 10 seconds)
data$tempoistr <- as.numeric(data$tempoistr)
data <- data[-c(which(data$tempoistr < 10)), ]

# Extract causal evaluation scores for the control group
vector_illusion <- data$vd[data$gruppo == 1]

The causal evaluation is treated as a metric (interval-like) variable, with scores ranging from 0 to 100. For the control group (N = 60), the distribution of the dependent variable is fairly spread out, with a mean of 58.58, a median of 63, and a standard deviation of 18.21. The range spans from a minimum of 16 to a maximum of 90, and 50% of the data falls between 50 and 73 (1st Quartile = 49.5, 3rd Quartile = 73). The distribution exhibits a moderate negative skew (skewness = -0.63), which may be partially attributable to the data being bounded between 0 and 100. Additionally, the distribution is platykurtic (kurtosis = -0.45), indicating a flatter shape compared to a normal distribution.

To visualize this distribution, we use a combination of plots: the boxplot highlights the quartiles and potential outliers, the half-eye plot provides a kernel density estimate, and the dot plot emphasizes individual data points.

# Load required package for summary statistics
library(pastecs)

# View the length and summary of the vector
length(vector_illusion)
[1] 60
summary(vector_illusion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   49.50   63.50   58.58   73.00   90.00 
# Calculate detailed descriptive statistics
round(stat.desc(vector_illusion, norm = TRUE), 2)
     nbr.val     nbr.null       nbr.na          min          max        range 
       60.00         0.00         0.00        16.00        90.00        74.00 
         sum       median         mean      SE.mean CI.mean.0.95          var 
     3515.00        63.50        58.58         2.39         4.78       342.59 
     std.dev     coef.var     skewness     skew.2SE     kurtosis     kurt.2SE 
       18.51         0.32        -0.63        -1.02        -0.45        -0.37 
  normtest.W   normtest.p 
        0.95         0.01 
# Load ggplot packages
library(ggdist); library(ggthemes); library(ggplot2)

library(ggokabeito) #Colorblind friendly palette

# Create a visual representation of the causal evaluation distribution
ggplot(mapping = aes(y = vector_illusion, fill = factor(1))) + 
  scale_fill_okabe_ito( alpha = .9) +  # Colorblind-friendly palette
  stat_halfeye(adjust = 0.9, justification = -0.2, .width = 0, point_colour = NA) +  # Half-eye plot
  geom_boxplot(width = 0.12, outlier.color = NA, alpha = 0.5) +  # Boxplot
  stat_dots(side = "left", justification = 1.1, binwidth = 1) +  # Dot plot
  labs(y = "Causal Evaluation", x = "") + 
  coord_flip() +  
  theme_bw() +  
  theme(legend.position = "none") 

1.2 Generative Distributions

In this section, we aim to identify the distribution that best describes our sample data. To do so, we compare our data to various theoretical distributions using the Cullen and Frey graph. This graph plots kurtosis against the square of skewness (Cullen & Frey, 1999), allowing us to visually assess how well our data aligns with different distribution families.

The data are represented as a red point on the graph, indicating how closely the distribution resembles known distribution families. To enhance the analysis, we also include bootstrapped data (generated from the sample) to assess how the distribution holds up under resampling.

To generate the graph and perform the analysis, we use the fitdistrplus package. The following code fits the data to various distributions and plots the Cullen and Frey graph, treating the variable as discrete (assuming the random variable takes countable values).

# Load the required package for distribution fitting
library(fitdistrplus)
Loading required package: MASS
Loading required package: survival
# Generate the Cullen and Frey graph with bootstrapping (1000 resamples)
descdist(vector_illusion, boot = 1000, discrete = TRUE)

summary statistics
------
min:  16   max:  90 
median:  63.5 
mean:  58.58333 
estimated sd:  18.50908 
estimated skewness:  -0.6617163 
estimated kurtosis:  2.710791 

We consider two candidate distributions for the data: the Normal distribution and the Poisson distribution. To determine the most suitable model, we perform a series of diagnostic plots and statistical tests that compare the empirical distribution of the data against the theoretical distributions.

The following diagnostic plots are used:

  • Density comparison (denscomp): Compares the histogram of the empirical data to the fitted density functions of the theoretical distributions.

  • Q-Q plot (qqcomp): Plots the theoretical quantiles of the fitted distributions against the quantiles of the empirical data.

  • CDF comparison (cdfcomp): Compares the cumulative distribution function (CDF) of the empirical data to the CDFs of the fitted distributions.

  • P-P plot (ppcomp): Plots the theoretical cumulative probabilities against the empirical cumulative probabilities.

From these diagnostic plots, we observe that the Normal distribution fits the data better than the Poisson distribution. This observation is further confirmed by the goodness-of-fit tests, which indicate that the Normal distribution is the most probable model for our data. Specifically, the Akaike weight for the Normal distribution is approximately 1, suggesting that it is the best-fitting model.

It is important to note that the Normal distribution fitted to the data is effectively truncated, particularly on the right-hand side. This truncation occurs because the mean of the distribution exceeds the median point of 50, and the data are constrained within the boundaries of 0 to 100. As a result, values that would typically fall outside the upper boundary are clipped, leading to skewness, especially in the left tail. Based on these observations, we assume that the truncated Normal distribution is the most appropriate generative distribution for further analysis.

# Fit distributions to the data
fnorm <- fitdist(vector_illusion, "norm")  # Fitting Normal distribution
fpois <- fitdist(vector_illusion, "pois")  # Fitting Poisson distribution

plot.legend <- c("Normal", "Poisson")
par(mfrow = c(2, 2))

# Plot the density comparison 
denscomp(list(fnorm, fpois), legendtext = plot.legend)

# Plot the Q-Q plot comparison 
qqcomp(list(fnorm, fpois), legendtext = plot.legend)

# Plot the CDF comparison 
cdfcomp(list(fnorm, fpois), legendtext = plot.legend)

# Plot the P-P plot comparison 
ppcomp(list(fnorm, fpois), legendtext = plot.legend)

# Perform goodness-of-fit tests
gofstat(list(fnorm, fpois), fitnames = c("Normal", "Poisson"))
Goodness-of-fit statistics
                                Normal    Poisson
Kolmogorov-Smirnov statistic 0.1160484  0.2933251
Cramer-von Mises statistic   0.1692820  1.8065351
Anderson-Darling statistic   1.0540738 30.3656046

Goodness-of-fit criteria
                                 Normal  Poisson
Akaike's Information Criterion 523.4556 743.5827
Bayesian Information Criterion 527.6443 745.6771
# Calculate Akaike weights 
C <- gofstat(list(fnorm, fpois), fitnames = c("Normal", "Poisson"))

AIC_W_pois <- exp(-1/2 * (C$aic[2] - C$aic[1])) / 
  (exp(-1/2 * (C$aic[1] - C$aic[2])) + exp(-1/2 * (C$aic[2] - C$aic[1])))

AIC_W_norm <- exp(-1/2 * (C$aic[1] - C$aic[2])) / 
  (exp(-1/2 * (C$aic[1] - C$aic[2])) + exp(-1/2 * (C$aic[2] - C$aic[1])))

# Print the Akaike weights for both distributions
as.numeric(AIC_W_norm); as.numeric(AIC_W_pois)
[1] 1
[1] 2.511859e-96

2. Simulating Approach to Estimate ES

In this section, we employ a simulation-based approach to estimate a meaningful ES. Specifically, we aim to generate simulated data from a theoretical distribution (the truncated Normal distribution) and hypothesize about the potential meaningful difference between two groups, namely NL (i.e., the control group) and FL (i.e., the treatment group).

2.1 Simulating the Distribution

We have already established that the Normal distribution is the most likely model for our data. However, we must account for the fact that our data is discrete and constrained within the boundaries of 0 and 100. To address these constraints, we simulate the data using a truncated normal distribution, which respects both the boundaries and the characteristics of the data. The truncated normal distribution ensures that all generated values fall within the specified range of 0 to 100.

We will use a custom function, based on the qnorm and runif functions, to generate truncated normal data:

  • qnorm: The quantile function for the normal distribution, which transforms uniformly distributed random numbers into quantiles of the normal distribution.

  • runif: The function used to generate uniformly distributed random numbers, which are then transformed by qnorm into the desired normal distribution.

  • round: Since our data needs to be discrete, we round the simulated values to the nearest integer.

By imposing boundaries on the simulated data, we truncate the distribution. This means that any data points that would normally fall outside the specified range (0 to 100) are trimmed. Consequently, truncation introduces skewness in the distribution, particularly when the mean is positioned closer to one of the boundaries. This behavior is expected when dealing with bounded data.

When the mean of the distribution passed to the function is more extreme (i.e., closer to the boundaries), the simulation generates a distribution with a slightly more centered mean. This occurs because truncation removes values in the tails, effectively shrinking the range of the distribution. As a result, the standard deviation (SD) of the distribution is reduced as the mean moves toward the boundaries, because truncation limits the spread of the data.

# Custom function to generate data from a truncated normal distribution
tnorm_f <- function(n, mean, sd, a = 0, b = 100) {
  qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd) 
  # Generate uniform random values and convert them 
  #into quantiles of the normal distribution
}

# Simulate data using the truncated normal distribution
set.seed(20242025)
x <- round(tnorm_f(length(vector_illusion), 58.58, 18.21))  # Truncated at 0 and 100

# Plot histogram
ggplot(mapping = aes(x, fill = factor(1))) + 
  geom_histogram(binwidth = 10, color = "black") +
  theme_bw() + 
  scale_fill_okabe_ito(order = 2) +
  theme(legend.position = "none") +
  labs(x = "Values", y = "Frequencies")

# Simulate data with different means to observe the effects of truncation
a <- round(tnorm_f(10e3, 50, 20))  # Mean = 50
b <- round(tnorm_f(10e3, 60, 20))  # Mean = 60
c <- round(tnorm_f(10e3, 70, 20))  # Mean = 70
d <- round(tnorm_f(10e3, 30, 20))  # Mean = 30

# Plot histograms Mean = 50
ggplot(mapping = aes(a, fill = factor(1))) + 
  geom_histogram(binwidth = 10, color = "black") +
  theme_bw() + 
  scale_fill_okabe_ito(order = 3) +
  theme(legend.position = "none") +
  labs(x = "Values", y = "Frequencies")

# Plot histograms Mean = 60
ggplot(mapping = aes(b, fill = factor(1))) + 
  geom_histogram(binwidth = 10, color = "black") +
  theme_bw() + 
  scale_fill_okabe_ito(order = 4) +
  theme(legend.position = "none") +
  labs(x = "Values", y = "Frequencies")

# Plot histograms Mean = 70
ggplot(mapping = aes(c, fill = factor(1))) + 
  geom_histogram(binwidth = 10, color = "black") +
  theme_bw() + 
  scale_fill_okabe_ito(order = 5) +
  theme(legend.position = "none") +
  labs(x = "Values", y = "Frequencies")

# Plot histograms Mean = 30
ggplot(mapping = aes(d, fill = factor(1))) + 
  geom_histogram(binwidth = 10, color = "black") +
  theme_bw() + 
  scale_fill_okabe_ito(order = 6) +
  theme(legend.position = "none") +
  labs(x = "Values", y = "Frequencies")

# Simulate more data with different means to observe the effects of truncation
x <- round(tnorm_f(10000, 50, 20))  # Mean = 50
y <- rnorm(10000, 50, 20)  # Untruncated normal data for comparison

# Calculate means and SDs for truncated and untruncated data
mean(x); sd(x)
[1] 49.7048
[1] 19.04769
mean(y); sd(y)
[1] 49.99585
[1] 20.04542
# Simulate with higher mean to observe the effect of truncation more clearly
x <- round(tnorm_f(10000, 65, 20))  # Mean = 65
y <- rnorm(10000, 65, 20)  # Untruncated normal data for comparison

# Calculate means and SDs for the higher mean
mean(x); sd(x)
[1] 63.1889
[1] 18.2378
mean(y); sd(y)
[1] 64.75427
[1] 19.90675

2.2 Comparisons

In this section, we compare how well the truncated normal distribution simulates the empirical data, relative to the normal distribution without boundaries. We visually examine the three distributions (empirical, normal, and truncated normal) and compute the overlap indices between the empirical data and the theoretical distributions. The overlap metric quantifies how much the distributions overlap, providing insight into how well each simulated distribution fits the empirical data.

From this analysis, we observe that the truncated normal distribution, with a mean of 60 and a standard deviation of 20 (closely approximating the empirical mean of 58.58 and standard deviation of 18.21), provides a marginally better fit for the data compared to the normal distribution without boundaries. Nonetheless, we continue to use the truncated normal distribution, as we anticipate that, by approaching more extreme values for the mean, this simulation approach will better capture the behavior of the data. Specifically, the truncated normal distribution allows us to more effectively model the data’s characteristics based on the positioning of the mean within the bounded range.

# Simulate data for normal and truncated normal distributions
set.seed(423)

simulated_norm <- round(rnorm(1000000, mean(vector_illusion), sd(vector_illusion)))  

simulated_trunc <- round(tnorm_f(1000000, 60, 20))  

# Plot the distributions using ggplot

empirical_df <- data.frame(value = vector_illusion, type = "Empirical")
simulated_df <- data.frame(value = simulated_norm, type = "Simulated Norm")
simulated_df1 <- data.frame(value = simulated_trunc, type = "Simulated Trunc")

combined_df <- rbind(empirical_df, simulated_df, simulated_df1)

ggplot(combined_df, aes(x = value, col = type)) +
  geom_density(alpha = 0.5) +  
  scale_color_okabe_ito()  +
  labs(title = "Empirical and Simulated Data Distributions",
       x = "Value",
       y = "Density") +
  theme_bw() +  
  theme(legend.position = "top")  

# Load the required package
library(overlapping)
Loading required package: testthat
# Calculate overlap coefficient with normal distribution
overlap_norm_values <- rep(NA, 1000)
for(i in 1:1000){
  simulated_sample <- round(rnorm(length(vector_illusion), 58.58, 18.41))
  overlap_norm_values[i] <- overlap(list(simulated_sample, vector_illusion), type = "1")[1]
}
median(as.numeric(overlap_norm_values))
[1] 0.8612745
# Calculate overlap coefficient with truncated normal distribution
overlap_trunc_values <- rep(NA, 1000)
for(i in 1:1000){
  simulated_sample <- round(tnorm_f(length(vector_illusion), 60, 20))
  overlap_trunc_values[i] <- overlap(list(simulated_sample, vector_illusion), type = "1")[1]
}
median(as.numeric(overlap_trunc_values))
[1] 0.8797271

2.3 ES estimation

In this analysis, we are focused on estimating a meaningful ES within the context of a Likert scale ranging from 0 to 100. Here, 0 represents the lowest possible rating (i.e., “Definitely not”), and 100 represents the highest possible rating (i.e., “Definitely yes”). A 10-point change on this scale can be defined as a meaningful ES, corresponding to a 1/10 shift in the evaluation of causality. This shift could represent the degree to which participants’ evaluations of a causal relationship differ when presented in a FL versus their NL. This shift is plausible and reflects a moderate change in the associative strength between the cue and the outcome, providing a clear benchmark for evaluating the effect of the manipulation.

To simulate data for the NL group, we assume that a truncated normal distribution with a mean rating of 60 and a standard deviation (SD) of 20 is the generative distribution. This parametrization is based on the fit of data from our previous experiment, where the observed mean was 58.58 and the SD was 18.21. Using this distribution, the simulation produced a mean of 58.77 and a SD of 18.57, which closely mirrors the characteristics of the control group data.

For the FL group, we simulate a second truncated normal distribution with the mean shifted 10 points lower, representing the effect size. This results in a mean of 50, while the standard deviation remains the same at 20, ensuring that the spread of values in the FL group is similar to the NL group. The simulation for the FL group yielded a mean of 49.35 and a standard deviation of 19.07.

When parameterizing from a truncated normal distribution, we observe that the more extreme the initial parameters (such as a large shift in the mean), the more the mean is reduced, but the standard deviation also decreases. This behavior is expected, as truncation naturally compresses the distribution. However, in the next simulations, we will be cautious to ensure that the difference in means between the two groups remains at least 9.9, changing a little bit the initial parametrization, as the truncated distribution may sometimes produce a smaller difference than the target 10-point shift.

# Load libraries
library(effectsize)

# Simulate data for NL (Mean = 60, SD = 20)
set.seed(4234)  
a <- round(tnorm_f(10000, 60, 20))  

# Simulate data for FL (Mean = 60 - 10, SD = 20)
b <- round(tnorm_f(10000, mean(a) - 10, 20))  # FL group 

# Calculate athe means and SDs for both groups
mean(a); sd(a)  # NL group
[1] 58.7796
[1] 18.57298
mean(b); sd(b)  # FL group
[1] 49.3576
[1] 19.07942
# Plot the densities 
data <- data.frame(
  value = c(a, b),
  group = factor(rep(c("NL", "FL"), each = length(a)))
)


ggplot(data, aes(x = value, fill = group)) +
  geom_density(alpha = 0.5) +  
  labs(x = "Values", y = "Density") +
  scale_fill_okabe_ito(order=c(7,8)) +
  theme_bw() +  
  theme(legend.title = element_blank())

# Calculate Cohen's D 
library(effectsize)
cohens_d(a,b)
Cohen's d |       95% CI
------------------------
0.50      | [0.47, 0.53]

- Estimated using pooled SD.

We are uncertain about the exact mean for the NL group in the upcoming experiment, as some features are expected to change. However, we can assume that the NL group mean will vary within a range of -5 to +5 points compared to the previously observed data. Since the SD is likely to decrease when the mean shifts toward the extremes of the distribution and increase when the mean moves toward the center, we will use a fixed SD parameter value of 20 for all the simulated truncated normal distributions.

The mean of the FL group is set to be 10 points lower than the corresponding NL group mean, and we maintain the SD parameter at 20. The ES for each pair of NL and FL distributions is calculated using Cohen’s d. The following code simulates control and treatment group data for a range of NL group means, specifically from 55 to 65. For each NL mean, it calculates Cohen’s d as measure of the ES between the NL and FL groups. The simulation ensures that the mean difference between the two groups is always at least 10 points, adjusting the FL group mean parameter as necessary to meet this condition.

The output includes a visualization of how Cohen’s d varies across the range of NL group means, from 55 to 65. The histogram displays the distribution of the calculated effect sizes, and a line plot shows how the effect size changes as the NL group mean shifts. The results suggest that the hypothesized ES is moderate, as in all simulations, Cohen’s d did not drop below 0.5.

# Define the function to simulate data 
calculate_effect_size <- function(NLmean, FLdiff, n=10000, sd=20) {
  v <- rep(NA, 5)
  control_data <- round(tnorm_f(n, NLmean, sd))  # Simulate NL group 
  treatment_data <- round(tnorm_f(n, NLmean - FLdiff, sd))  # Simulate FL group 
  x <- 0
  
  # Ensure the difference between means is at least 9.9
  if ((mean(control_data) - mean(treatment_data) < 9.9)) {
    while (mean(control_data) - mean(treatment_data) < 9.9) {
      x <- x + 0.0005
      treatment_data <- round(tnorm_f(n, NLmean - (FLdiff + x), sd))
      
      # Break when the mean difference is >= 9.9
      if (mean(control_data) - mean(treatment_data) >= 9.9) {
        break
      }
    }
  }

  d <- as.numeric(cohens_d(control_data, treatment_data)[1])
  v[1] <- d
  v[2] <- mean(control_data)
  v[3] <- sd(control_data)
  v[4] <- mean(treatment_data)
  v[5] <- sd(treatment_data)
  
  return(v)
}

# Define the range of NL group means from 55 to 65
NLmeans <- seq(55, 65, by = 0.1)

# Store the results
mean1 <- rep(NA, length(NLmeans))
mean2 <- rep(NA, length(NLmeans))
sd1 <- rep(NA, length(NLmeans))
sd2 <- rep(NA, length(NLmeans))
cohen <- rep(NA, length(NLmeans))

# Loop through each possible control group mean 
for(i in 1:length(NLmeans)) {
  vector <- calculate_effect_size(NLmeans[i], 10, n = 100000)
  cohen[i] <- vector[1]
  mean1[i] <- vector[2]
  mean2[i] <- vector[4]
  sd1[i] <- vector[3]
  sd2[i] <- vector[5]
}

# Store the results
effect_size_df <- data.frame(
  NLmean = mean1,
  FLmean = mean2,
  NLsd = sd1,
  FLsd = sd2,
  d = cohen
)

round(effect_size_df$NLmean - effect_size_df$FLmean,1)
  [1]  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9 10.0  9.9  9.9
 [16]  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9
 [31]  9.9 10.0  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9 10.0  9.9
 [46]  9.9  9.9  9.9 10.0  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9
 [61]  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9  9.9 10.0
 [76]  9.9  9.9  9.9  9.9 10.0  9.9  9.9  9.9  9.9  9.9  9.9 10.0  9.9  9.9  9.9
 [91]  9.9 10.0  9.9  9.9 10.0  9.9 10.0  9.9  9.9  9.9  9.9
# Distribution of Cohen's d values
ggplot(effect_size_df, aes(x = d)) +
  geom_histogram(fill="darkorange")+
  labs(y = "Frequencies", x = "Cohen's d") +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#  Cohen's d for various control group means
ggplot(effect_size_df, aes(x = mean1, y = d)) +
  geom_line() +
  geom_point() +
  labs(x = "NL Group Mean", y = "Cohen's d") +
  theme_bw()

The goal of this final simulation is to hypothesize that the treatment group (FL) will show greater variability not only due to a reduction in the mean of the truncated normal distribution but also because individuals in the FL group might show differing levels of response to the treatment. To account for this increased variability, we will simulate different treatment group standard deviations (SDs), ranging from 20 to 25, with a step size of 0.5.

Finally, we store the results in a dataframe that includes Cohen’s d, which will constitute our prior knowledge on the expected ES, which will be used for future analyses. We also store the updated mean parameterization of the FL group. In each simulation, the FL group’s mean parameter has been adjusted to ensure that the difference between the control and treatment group means is always at least 9.9. This updated mean will be useful for design analysis, as it ensures the simulation maintains the required mean difference and allows for an accurate representation of the distribution in next simulations.

# Creation of the function
calculate_effect_size_with_varying_treatment_sd <- function(NLmean, NLsd, FLdiff, treatment_sd, n = 10000) {
  # Simulate NL group data with fixed SD = 20
  control_data <- round(tnorm_f(n, NLmean, NLsd))  #
  # Simulate treatment group data with fixed mean difference and varying SD
  treatment_data <- round(tnorm_f(n, NLmean - FLdiff, treatment_sd))
  
  # Ensuring that the difference between means is at least 9.9
  x <- 0
  while (mean(control_data) - mean(treatment_data) < 9.9) {
    x <- x + 0.0005
    treatment_data <- round(tnorm_f(n, NLmean - (FLdiff + x), treatment_sd))
  }
  
  # Cohen's d
  d <- as.numeric(cohens_d(control_data, treatment_data, pooled_sd = TRUE)[1])
  
  # Return the results and new parameters to store
  
  return(c(d, NLmean - (FLdiff + x), mean(control_data), sd(control_data), NLmean, mean(treatment_data), sd(treatment_data), treatment_sd))
}

# Define the range of NL group means (55 to 65)
control_means <- seq(55, 65, by = 0.5)

# Define the range of FL group standard deviations (20 to 25)
treatment_sds <- seq(20, 25, by = 0.5)

# Fixed NL group SD 
control_sd <- 20

# Vectors to store results
mean1_var <- rep(NA, length(control_means) * length(treatment_sds))
mean1_true <- rep(NA, length(control_means) * length(treatment_sds))
mean2_var <- rep(NA, length(control_means) * length(treatment_sds))
mean2_true <- rep(NA, length(control_means) * length(treatment_sds))
sd1_var <- rep(NA, length(control_means) * length(treatment_sds))
sd2_var <- rep(NA, length(control_means) * length(treatment_sds))
cohen_var <- rep(NA, length(control_means) * length(treatment_sds))
sd_true <- rep(NA, length(control_means) * length(treatment_sds))
counter <- 1

# For each combination of control mean and treatment SD
for (mean_val in control_means) {
  for (treatment_sd_val in treatment_sds) {
    vector <- calculate_effect_size_with_varying_treatment_sd(mean_val, control_sd, 10, treatment_sd_val, n = 100000)
    cohen_var[counter] <- vector[1]
    mean1_true[counter] <- vector[2]
    mean1_var[counter] <- vector[3]
    mean2_true[counter] <- vector[5]
    mean2_var[counter] <- vector[6]
    sd1_var[counter] <- vector[4]
    sd2_var[counter] <- vector[7]
    sd_true[counter] <- vector[8]
    
    counter <- counter + 1
  }
}

effect_size_treatment_sd_df <- data.frame(
  NLmean = mean1_var,
  NLmeantrue = mean1_true,
  FLmean = mean2_var,
  FLmeantrue = mean2_true,
  NLsd = sd1_var,
  FLsd = sd2_var,
  d = cohen_var,
  sd = sd_true
)

# Cohen's d distribution
ggplot(effect_size_treatment_sd_df, aes(x = d)) +
  geom_histogram(fill="darkorange")+
  labs(y = "Frequencies", x = "Cohen's d") +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Cohen's d for different NL means and FL SD values
ggplot(effect_size_treatment_sd_df, aes(x = FLmeantrue, y = d, color = factor(sd))) +
  geom_line() +
  geom_point() +
  labs(x = "NL Group Mean", y = "Cohen's d") +
  scale_color_viridis_d() +
  scale_fill_viridis_c() +
  theme_bw() + 
  geom_smooth(method = "lm", alpha = 0.2) +
  theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'

3. References

  • Arnold J (2024). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 5.1.0, https://CRAN.R-project.org/package=ggthemes.

  • Barrett M (2024). ggokabeito: ‘Okabe-Ito’ Scales for ‘ggplot2’ and ‘ggraph’. R package version 0.1.0.9000, commit e28e8b7a0a3301ac40722fb07ed082bde424bb8f, https://github.com/malcolmbarrett/ggokabeito.

  • Ben-Shachar M, Lüdecke D, Makowski D (2020). effectsize: Estimation of Effect Size Indices and Standardized Parameters. Journal of Open Source Software, 5(56), 2815. doi: 10.21105/joss.02815

  • Cullen, A. C., & Frey, H. C. (1999). Probabilistic techniques in exposure assessment: a handbook for dealing with variability and uncertainty in models and inputs. Springer Science & Business.

  • Dalla Bona, S., & Vicovaro, M. (2024). Does perceptual disfluency affect the illusion of causality?. Quarterly Journal of Experimental Psychology, 77(8), 1727-1744. https://doi.org/10.1177/17470218231220928

  • Delignette-Muller, M.L., Dutang, C. (2015). fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software, 64(4), 1-34. DOI 10.18637/jss.v064.i04.

  • Grosjean P, Ibanez F (2024). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.4.2, https://CRAN.R-project.org/package=pastecs.

  • Kay, M. (2024). “ggdist: Visualizations of Distributions and Uncertainty in the Grammar of Graphics.” IEEE Transactions on Visualization and Computer Graphics, 30(1), 414-424. https://doi.org/10.1109/TVCG.2023.3327195.

  • Pastore M, Loro PAD, Mingione M, Calcagni’ A (2022). overlapping: Estimation of Overlapping in Empirical Distributions. R package version 2.1, https://CRAN.R-project.org/package=overlapping.

  • Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

  • Wickham, H. (2023). readxl: Read Excel Files. R package version 1.4.3, https://CRAN.R-project.org/package=readxl.